docs(simd): W1a consumer contract — 5 primitive specs + VPABSB correction#149
Conversation
…tion
Captures the consumer-side architectural contract that the AdaWorldAPI/
lance-graph spine is staging against this ndarray fork. A PRE-MERGE
audit of lance-graph main on 2026-05-16 surfaced 158 raw-intrinsic
violations across 5 consumer crates plus 3 missing primitives here
that block clean remediation. This doc is the spec for what the
W1a primitive queue must do; consumer-side migrations cannot proceed
until these primitives ship.
Files (2):
1. .claude/knowledge/vertical-simd-consumer-contract.md (NEW, 328L)
The pattern: struct methods on typed wrappers + closure-
parameterized batch primitives. Consumers see zero raw intrinsics
and zero arch-specific cfg; the polyfill (this repo) owns runtime
feature dispatch, lane chunking, tail handling, scalar fallback.
**VPABSB correction (P0):** _mm512_abs_epi8 does NOT saturate
i8::MIN — it returns the same bit pattern (0x80 → 0x80, which is
still -128 when interpreted as i8). The lance-graph PR #400 codex
P1 review caught the original claim that VPABSB saturates by ISA;
the correct AVX-512 saturating_abs is
_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))
The VPMINUB clamp remaps 0x80 (unsigned 128 > 127) down to 0x7f.
NEON vqabsq_s8 is already hardware-saturating (q-suffix); scalar
i8::saturating_abs is correct.
Five W1a primitive specs with per-arch implementation hints,
API surfaces, mandatory parity-test requirements, and consumer
call sites:
- TD-NDARRAY-SIMD-UNPACK-I4-16D: I8x16::from_i4_packed_u64 +
batch_packed_i4_16<E, F> closure-batch (consumer:
lance-graph mul::i4_eval::batch, 5 fns)
- TD-NDARRAY-SIMD-SATURATING-ABS-I8: I8x16::saturating_abs (the
VPABSB+VPMINUB fix above)
- TD-NDARRAY-SIMD-GATHER: U16x8::gather_u16 + palette_lookup_u8x8
(consumer: bgz17/src/simd.rs:88)
- TD-NDARRAY-SIMD-PREFETCH: prefetch_read_t0/t1/t2 cross-arch
(consumer: bgz17/src/prefetch.rs:96,100)
- TD-NDARRAY-SIMD-POPCOUNT-U64: U64x8::popcnt + U64x8::xor_popcount
(consumer: holograph/hamming.rs, blasgraph/types.rs)
Three W1.5 deferred primitives (gated on lance-graph:crates/sigker
benchmarking + jc Pillar 11 activation): signature-PDE-sweep,
randomized-projection, lyndon-pack. Mentioned so W1a additions
are designed broad enough to compose with these later.
Acceptance criteria: all three backends mandatory, parity test
mandatory, saturating/overflow semantics documented, no new
is_*_feature_detected! outside simd_caps, // SAFETY: on all
unsafe blocks, consumer site cited in PR description.
Cross-links to lance-graph knowledge doc, simd-savant card,
E-SIMD-SWEEP-1 epiphany, TD-NDARRAY-SIMD-* / TD-SIMD-SWEEP-W*
debt ledger, PR #398/#399/#400 history. Plus Intel + ARM
intrinsic references, Hambly-Lyons 2010, Cuchiero-Schmocker-
Teichmann 2021, Jirak 2016.
2. CLAUDE.md § Hard Rules (+1 row)
"All new public pub fn in src/simd_*.rs follows the W1a consumer
contract" — pointer to the doc, callout that VPABSB does NOT
saturate i8::MIN, and statement of the gating relationship
(missing primitives here force consumer-side raw-intrinsic
violations, so additions here are blocking the consumer sweep).
Once this lands, W1a workers can spawn against this branch in
parallel: 5 small PRs against ndarray master, each implementing
one primitive from the queue with the parity tests mandated by
this doc.
https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c2161b4a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ``` | ||
|
|
||
| **Per-arch implementation:** | ||
| - **AVX2/AVX-512:** `_mm256_i32gather_epi32` with index widening + downcast (caveat: `_mm256_i32gather_epi32` reads 32 bits per index; for u16 values pack two indices per gather slot, or downcast post-gather). |
There was a problem hiding this comment.
Avoid reading past u16 tables with 32-bit gathers
For gather_u16, using _mm256_i32gather_epi32 against a &[u16] can read four bytes starting at table[index]; when an allowed index is table.len() - 1, that crosses the end of the slice even though the later bounds contract only requires max(indices) < table.len(). If a future implementation follows this hint, release builds can perform an out-of-bounds SIMD load for valid last-element lookups instead of safely returning the final u16.
Useful? React with 👍 / 👎.
| **Per-arch implementation:** | ||
| - **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`. | ||
| - **AVX-512 without VPOPCNTDQ:** fallback via `_mm512_sad_epu8` on a per-byte popcount LUT (Mula's algorithm using VPSHUFB). | ||
| - **NEON:** `vcntq_u8` for byte popcount, then horizontal sum within each u64 via `vaddvq_u8` or `vpaddlq_u8` cascade. |
There was a problem hiding this comment.
Preserve u64 lanes in the NEON popcount recipe
For lane-wise U64x8::popcnt, vaddvq_u8 reduces all bytes in a NEON vector to one scalar, so using it as described here would merge the counts of multiple u64 lanes instead of returning one count per lane. This only shows up on the NEON backend, where a future implementation following this contract would disagree with the scalar and AVX results for inputs such as [1, 0, ...] versus [0, 1, ...]; the hint should require a per-u64 widening/pairwise reduction instead.
Useful? React with 👍 / 👎.
Resolves add/add conflict on .claude/knowledge/vertical-simd-consumer-contract.md by taking master's version (PR #149 — the polished READ BY / P0 TRIGGERS form with agent routing). CLAUDE.md gains the W1a contract hard rule pointer from master. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
Captures the consumer-side architectural contract that the
AdaWorldAPI/lance-graphspine is staging against this ndarray fork. A PRE-MERGE audit of lance-graph main on 2026-05-16 surfaced 158 raw-intrinsic violations across 5 consumer crates plus 3 missing primitives here that block clean remediation. This doc is the spec for what the W1a primitive queue must do; consumer-side migrations cannot proceed until these primitives ship.Files
.claude/knowledge/vertical-simd-consumer-contract.md(NEW, 328 LOC) — the canonical W1a spec.CLAUDE.md§ Hard Rules (+1 line) — pointer + VPABSB callout + gating-relationship statement.The pattern
ndarray's SIMD surface is designed AS-IF for our exact workloads: struct methods on typed wrappers (
I8x16,U8x32,F32x16,U64x8, …) plus closure-parameterized batch primitives that absorb the consumer's domain semantics. Consumers see zero raw intrinsics, zerocfg(target_arch), zero runtime feature-detect — they callI8x16::from_i4_packed_u64(...),I8x16::saturating_abs(...),batch_packed_i4_16(..., |lanes, aux| { ... }). The polyfill (this repo) owns dispatch, chunking, tail handling, and scalar fallback.P0: VPABSB correction
The original
lance-graphPR #400 capture claimed_mm512_abs_epi8saturatesi8::MIN → 127by ISA. This is wrong — VPABSB returns the same bit pattern for0x80(i.e.,abs(i8::MIN) = i8::MIN, since+128doesn't fit ini8). Codex caught this on PR #400; the binding correction is:The VPMINUB (unsigned-byte min) clamp remaps
0x80(= 128 unsigned > 127) down to0x7f. All other lanes are unchanged sinceabs(x) < 0x80forx ≠ i8::MIN. NEONvqabsq_s8is already hardware-saturating (theqsuffix); scalari8::saturating_absis correct.Mandatory parity test:
The widen-then-negate trick used in
lance-graphPR #398's mul.rs is NOT a substitute — the new primitive must produce saturating semantics in the byte-wide register without widening, since downstream consumers will rely on byte-wide semantics for tight i4/i8 packed loops.W1a queue — 5 primitives this PR specs
Each will be a separate PR (parallel review, tight scope):
TD-NDARRAY-SIMD-UNPACK-I4-16DI8x16::from_i4_packed_u64+batch_packed_i4_16<E, F>closure-batchlance-graph::mul::i4_eval::batch(5 fns)TD-NDARRAY-SIMD-SATURATING-ABS-I8I8x16::saturating_abs(VPABSB + VPMINUB clamp)TD-NDARRAY-SIMD-GATHERU16x8::gather_u16+palette_lookup_u8x8bgz17/src/simd.rs:88TD-NDARRAY-SIMD-PREFETCHprefetch_read_t0/t1/t2cross-archbgz17/src/prefetch.rs:96,100TD-NDARRAY-SIMD-POPCOUNT-U64U64x8::popcnt+xor_popcountholograph/hamming.rs,blasgraph/types.rsW1.5 — deferred primitives (gated on lance-graph::sigker certification)
Three more queued behind
jc Pillar 11activation:signature-PDE-sweep,randomized-projection,lyndon-pack. The W1a additions must be designed broad enough to compose with these — in particular, the closure-batch shape introduced in W1a-#1 is the foundation for W1.5-#7 (randomized signatures).Acceptance criteria for each W1a PR
is_*_feature_detected!outsidesrc/hpc/simd_caps.rs// SAFETY:comments on allunsafeblocksThe
simd-savantagent on the lance-graph side runs PRE-MERGE against every W1a PR to verify compliance.Cross-references
AdaWorldAPI/lance-graphPR #399 — introduced thesimd-savantagent + autoattended-multiagent pattern docAdaWorldAPI/lance-graphPR #400 — architectural capture (alien-magic + EPIPHANIES + TECH_DEBT)AdaWorldAPI/lance-graphPR #398 — codex P1 (NEON OOB) + P2 (i8::MIN divergence) — the triggerAdaWorldAPI/lance-graphPR (open) — corrects the VPABSB claim inlance-graph's knowledge doc_mm512_abs_epi8,_mm512_min_epu8,_mm512_popcnt_epi64,_mm256_i32gather_epi32vqabsq_s8), VCNT (vcntq_u8)https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
Generated by Claude Code